The Annals of Applied Statistics 
2010, Vol. 4, No. 1, 396-421 
DOI: 10.1214/09-AOAS279 

© Institute of Mathematical Statistics. 2010 



VARIABLE SELECTION AND UPDATING IN MODEL-BASED 
DISCRIMINANT ANALYSIS FOR HIGH DIMENSIONAL 
DATA WITH FOOD AUTHENTICITY APPLICATIONS 
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and Adrian E. Raftery 2 ' 3 

University College Dublin, University of Glasgow and 
University of Washington, Seattle 

Food authenticity studies are concerned with determining if food 
samples have been correctly labeled or not. Discriminant analysis 
methods are an integral part of the methodology for food authenti- 
cation. Motivated by food authenticity applications, a model-based 
discriminant analysis method that includes variable selection is pre- 
sented. The discriminant analysis model is fitted in a semi-supervised 
manner using both labeled and unlabeled data. The method is shown 
to give excellent classification performance on several high-dimensional 
multiclass food authenticity data sets with more variables than ob- 
servations. The variables selected by the proposed method provide 
information about which variables are meaningful for classification 
purposes. A headlong search strategy for variable selection is shown 
to be efficient in terms of computation and achieves excellent classifi- 
cation performance. In applications to several food authenticity data 
sets, our proposed method outperformed default implementations of 
Random Forests, AdaBoost, transductive SVMs and Bayesian Multi- 
nomial Regression by substantial margins. 

1. Introduction. Foods that are expensive are subject to potential fraud 
where rogue suppliers may attempt to provide a cheaper inauthentic alterna- 
tive in place of the more expensive authentic food. Food authenticity studies 
are concerned with assessing the veracity of the labeling of food samples. 
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Discriminant analysis methods are of prime importance in food authentic- 
ity studies where samples whose authenticity is being assessed are classified 
using a discriminant analysis method and the labeling and classification are 
compared. Samples determined to have potentially inaccurate labeling can 
be sent for further testing to determine if fraudulent labeling has been used. 

Model-based discriminant analysis [Bensmail and Celeux (1996), Fraley 
and Raftery (2002)] provides a framework for discriminant analysis based on 
parsimonious normal mixture models. This approach to discriminant analy- 
sis has been shown to be effective in practice and, being based on a statistical 
model, it allows for uncertainty to be treated appropriately. 

In many applications, only a subset of the variables in a discriminant 
analysis contain any group membership information and including variables 
which have no group information increases the complexity of the analysis, 
potentially degrading the classification performance. Therefore, there is a 
need for including variable selection as part of any discriminant analysis 
procedure. Additionally, if a subset of variables is found to be important 
for classification purposes, then it suggests the potential for collecting a 
smaller subset of variables using inexpensive methods rather than the full 
high dimensional data. 

Variable selection can be completed as a preprocessing step prior to dis- 
criminant analysis (a filtering approach) or as part of the analysis procedure 
(a wrapper approach). Completing variable selection prior to the discrimi- 
nant analysis can lead to variables that have weak individual classification 
performance being omitted from the subsequent analysis. However, such 
variables could be important for classification purposes when jointly con- 
sidered with others. Hence, performing variable selection as part of the dis- 
criminant analysis procedure is preferred. 

Combining variable selection and linear or quadratic discriminant analy- 
sis has been considered previously in the literature; see McLachlan [(1992), 
Chapter 12] for a review. Many of these methods are based on measuring 
the Mahalanobis distance between groups before and after the inclusion of a 
variable into the discriminant analysis model. In the machine learning liter- 
ature, Kohavi and John (1997) developed a wrapper approach for combining 
variable selection in supervised learning, of which discriminant analysis is a 
special case. 

Variable selection is of particular importance in situations where there are 
more variables than observations available, that is, large p, small n (n<^p) 
problems [West (2003)]. These situations arise with increasing frequency 
in statistical applications, including genetics, proteomics, image processing 
and food science. The two food science applications considered in Section 2 
involve data sets with many more variables than observations. 

In this paper a version of model-based discriminant analysis is developed 
by adapting the model-based clustering with variable selection method of 
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Raftery and Dean (2006). This method of discriminant analysis builds a 
discriminant rule in a stepwise manner by considering the inclusion of extra 
variables into the model and also considering removing existing variables 
from the model based on their importance. The stepwise selection procedure 
is iterated until convergence. 

A brief review of model-based clustering and discriminant analysis is given 
in Section 3. The underlying model for model-based clustering with variable 
selection is reviewed in Section 3.1 and this model is extended to model- 
based discriminant analysis with variable selection in Section 3.2. In Sec- 
tion 3.3 the fitting of the discriminant analysis model is extended to in- 
corporate semi-supervised updating using both the labeled and unlabeled 
observations [Dean, Murphy and Downey (2006)] in order to improve the 
classification performance. 

Search strategies for selecting the variables for inclusion and exclusion 
are discussed in Section 3.4. A headlong search strategy is proposed that 
combines good classification performance and computational efficiency. The 
proposed methodology is applied to the high dimensional data sets in Sec- 
tion 4 and the methodology and results are discussed in Section 5. 

2. Data. 

2.1. Food authenticity and near infrared spectroscopy. An authentic food 
is one that is what it claims to be. Important aspects of food description 
include its process history, geographic origin, species/variety and purity. 
Food producers, regulators, retailers and consumers need to be assured of 
the authenticity of food products. 

Food authenticity studies are concerned with establishing whether foods 
are authentic or not. Many analytical chemistry techniques are used in 
food authenticity studies, including gas chromatography, mass spectroscopy 
and vibrational spectroscopic techniques (Raman, ultraviolet, mid-infrared, 
near-infrared and visible) . All of these techniques have been shown to be ca- 
pable of discriminating between certain sets of similar biological materials. 
Downey (1996) and Reid, O'Donnell and Downey (2006) provide reviews 
of food authenticity studies with an emphasis on spectroscopic methods. 
Near infrared (NIR) spectroscopy provides a quick and efficient method of 
collecting data for use in food authenticity studies [Downey (1996)]. It is 
particularly useful because it requires very little sample preparation and is 
nondestructive to the samples being tested. 

We consider two food authenticity data sets which consist of combined 
visible and near-infrared spectroscopic measurements from food samples of 
different types. The aim of the food authenticity study is to classify the 
food samples into known groups. The two studies are outlined in detail in 
Sections 2.2 and 2.3: 
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Fig. 1. The near-infrared spectra recorded for three examples of each meat species in the 
study. The discontinuity at 1100 nm is due to a sensor change at that value. The samples 
are colored as Beef= red, Lamb — green, Pork — blue, Turkey — orange, Chicken = yellow. 

• Classifying meats into species (Beef, Chicken, Lamb, Pork, Turkey). 

• Classifying olive oils into geographic origin (Crete, Peloponese, other). 

In both studies, combined visible and near infrared spectra were collected 
in reflectance mode using an NIRSystems 6500 instrument over the wave- 
length range 400-2498 nm at 2 nm intervals. The visible portion of the 
spectrum is the range 400-800 nm and the near-infrared region is the range 
800-2498 nm. Hence, the values collected for each food sample consist of 
1050 reflectance values taken at 2 nm intervals (see, for example, Figure 1). 
For the meat samples, twenty five separate scans were collected during a 
single passage of the spectrophotometer and averaged, after which the mean 
spectrum of a reference ceramic tile (16 scans) was recorded and subtracted 
from the mean spectrum. A similar process was used for the olive oil data, 
but fewer scans were used. Full details of the spectral data collection process 
are given in McElhinney, Downey and Fearn (1999) and Downey, Mclntyre 
and Davies (2003). 

The reflectance values in the visible and near-infrared region are produced 
by vibrations in the chemical bonds in the substance being analyzed. The 
data are highly correlated due to the presence of a large number of over- 
lapping broad peaks in this region of the electromagnetic spectrum and the 
presence of combinations and overtones. As a result, even though the data 
are very highly correlated, the reflectance values at adjacent wavelengths can 
have different sources and reflectance values at very different wavelengths 
can have the same source. So, the information encoded in each spectrum 
is recorded in a complex manner and spread over a range of locations. Os- 
borne, Fearn and Hindle (1993) provide an extensive review of the chemical 
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and technological aspects of near-infrared spectroscopy and its application. 
Further information on the combined spectra and their structure is given in 
Section 4 where the results of the analysis of the data are given. 

Because of the complex nature of the combined spectroscopic data, there 
is interest in determining if a small subset of reflectance values contain as 
much information for authentication purposes as the whole spectrum does. If 
a small number of variables contain sufficient information for authentication 
purposes, then this indicates the possibility of developing portable sensors 
for food authenticity studies that are more rapid and have a lower cost than 
recording the combined visible and near- infrared spectrum. In fact, portable 
sensors have been developed on a commercial basis for the authentication 
of Scottish whiskys [Connolly (2006)] using ultraviolet spectroscopic tech- 
nology. Hence, there are motivations for incorporating feature selection in 
the classification methods used on these data from the application and the 
modeling viewpoints. 

The problem of feature selection is especially difficult because the number 
of possible subsets of wavelengths that could be selected in this problem is 
2 1050 . So, efficient search strategies need to be used so that a good set of 
features can be selected without searching over all possible subsets. 

2.2. Homogenized meat data. McElhinney, Downey and Fearn (1999) 
constructed a collection of combined visible and near-infrared spectra from 
231 homogenized meat samples in order to assess the effectiveness of visible 
and near-infrared spectroscopy as a tool for determining the correct species 
of the samples. The samples collected for this study consist of 55 Chicken, 55 
Turkey, 55 Pork, 32 Beef and 34 Lamb samples. The samples were collected 
over an extended period of time and from a number of sources in order to 
ensure a representative sample of meats. 

For each sample, a spectrum consisting of 1050 reflectance measurements 
was recorded (as outlined in Section 2.1). A plot of all of the spectra is shown 
in Figure 1. We can see that there is a discrimination between the red meats 
(beef and lamb) and the white meats (chicken, turkey and pork) over some 
of the visible region (400-800 nm), but discrimination within meat colors is 
less clear. 

2.3. Greek olive oils data. Downey, Mclntyre and Davies (2003) recorded 
near-infrared spectra from a total of 65 extra virgin olive oil samples that 
were collected from three different regions in Greece (18 Crete, 28 Pelo- 
ponese, 19 other). Each data value consists of 1050 reflectance values over 
the visible and near-infrared range. The aim of their study was to assess the 
effectiveness of near-infrared spectroscopy in determining the geographical 
origin (see Figure 2) of the oils. 
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3. Model-based clustering and discriminant analysis. Model-based clus- 
tering [Banfield and Raftery (1993), Fraley and Raftery (1998, 2002), McLach- 
lan and Peel (2000)] uses mixture models as a framework for cluster analysis. 
The underlying model in model-based clustering is a normal mixture model 
with G components, that is, 



where f(-\fig, S s ) is a multivariate normal density with mean \x g and covari- 
ance T, g . 

A central idea in model-based clustering is the use of constraints on 
the group covariance matrices T, g ; these constraints use the eigenvalue de- 
composition of the covariance matrices to impose shape restrictions on the 
groups. The decomposition is of the form S 9 = \ g D g A g D^ , where X g is the 
largest eigenvalue, D g is an orthonormal matrix of eigenvectors, and A g is 
a diagonal matrix of scaled eigenvalues. Interpretations for the parameters 
in the covariance decomposition are as follows: X g = volume; A g = shape; 
D g = orientation. These parameters can be constrained in various ways to 
be equal or variable across groups. Additionally, the shape and orientation 
matrices can be set equal to the identity matrix. 

Bensmail and Celeux (1996) developed model-based discriminant analysis 
methods using the same covariance decomposition. An extension of model- 
based discriminant analysis that allows for updating of the classification 
rule using the unlabeled data was developed by Dean, Murphy and Downey 
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Fig. 2. Regions of Greece where the olive oil samples were collected. 
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(2006) and will be described in more detail in Section 3.3. Model-based 
clustering and discriminant analysis can be implemented in the statistics 
package R [R Development Core Team (2007)] using the mclust package 
[Praley and Raftery (1999, 2003, 2007)]. 

3.1. Model-based clustering with variable selection. We argue that vari- 
able selection needs to be part of the discriminant analysis procedure, be- 
cause completing variable selection prior to discriminant analysis may lose 
important grouping information. This argument is supported by the result 
of Chang (1983), who showed that the principal components corresponding 
to the larger eigenvalues do not necessarily contain information about group 
structure. This suggests that the commonly used filter approach of select- 
ing the first few principal components to explain a minimum percentage of 
variation can be suboptimal. A similar argument can be made that select- 
ing discriminating variables without reference to the grouping variable may 
miss important variables. In addition, some variables may contain strong 
group information when used in combination with other variables, but not 
on their own. Another critique of completing a variable (or feature) selec- 
tion step before supervised learning (filtering) is given by Kohavi and John 
(1997), Section 2.4. 

Raftery and Dean (2006) developed a stepwise variable selection wrapper 
for model-based clustering. With their method, variables are selected in a 
stepwise manner. Their method involves the stages: 

• A variable is proposed for addition to the set of selected clustering vari- 
ables. The Bayesian Information Criterion (BIC) is used to compare a 
model in which the variable contains extra information about the clus- 
tering beyond the information in the already selected variables versus a 
model where the variable doesn't contain additional information about the 
clustering beyond the information in the already selected variables. The 
variable with the greatest positive BIC difference is added to the model. 
If the proposed variable has a negative BIC difference, then no variable is 
added. 

• BIC is used to consider whether a variable should be removed from the 
model; This step is the reverse of the variable addition step. If all of the 
selected variables contain clustering information, then none is removed 
from the set of selected clustering variables. 

This process is iterated until no further variables are added or removed. This 
approach, that combines variable selection and cluster analysis, avoids the 
problems of completing variable selection independently of the clustering. 
While the stepwise variable selection wrapper proposed in Raftery and Dean 
(2006) and other wrapper approaches can give excellent clustering results, 
there is a considerable computational burden with wrapper approaches when 
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compared to filtering approaches; this is because the model needs to be fitted 
each time a variable is added or removed from the set of selected clustering 
variables. 

3.2. Model-based discriminant analysis with variable selection. We adapt 
the ideas of Raftery and Dean (2006) to produce a discriminant analysis tech- 
nique that includes a stepwise variable selection wrapper. This discriminant 
analysis method uses a stepwise variable selection procedure to find a subset 
of variables that gives good classification results. 

Each stage of the algorithm involves two steps: 

• Determine if a variable (not already selected) would contribute to the dis- 
criminant analysis model. In order to do this, a model comparison using 
BIC is used to compare a discriminant analysis model where the variable 
contains group information beyond the information in the already selected 
variables versus a model where the variable does not contain group infor- 
mation beyond the information in the already selected variables. Variables 
where the BIC difference is positive are candidates for addition to the set 
of selected variables; the procedure for searching for variables to add to 
the model is given in Section 3.4. 

• Determine if any selected variables should be removed from the discrimi- 
nant analysis model. This step is the reverse of the variable addition step. 
Variables where the BIC model comparison suggests that the variable does 
not contain group information are candidates for removing from the set 
of selected variables; the procedure for searching for variables to remove 
from the model is outlined in Section 3.4. 

Let (xi, X2, . . . ,x n ) be the observed data values and let (li, I2, . . . , l n ) be 
the group indicator variables for these observations where l{ g = 1 if observa- 
tion i belongs to group g and k g = otherwise. 

Suppose that the observation Xj is partitioned into three parts: x- c ^ are 
the variables already chosen; x^ is the variable being proposed; are 
the remaining variables. The decision on whether to include or exclude a 
proposed variable is based on the comparison of two models: 

. Grouping: p(x 4 |LJ = p(xf, x^, x 4 (o) |LJ = p(xf|x? \x ( ? ) )p(x?\x ( ? ) \l i ). 

• No Grouping: p(xj |LJ = p(x> c ^ , x^ p) , xj o) |LJ = p(xj o) |x^ p) , xf ) )p(x| p) |xf ' ) x 
P(x! c) |li). 

Figure 3 shows the difference between the "Grouping" and "No Grouping" 

(p) 

models for Xj. If the Grouping model holds, xj provides information about 

(c) 

which group the data value belongs to beyond that provided by x^ , while 

(p) 

if the No Grouping model holds, x^ provides no extra information. 
The Grouping and No Grouping models are specified as follows: 
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Table 1 

Constrained covariance structures in model-based clustering as implemented 
in the mclust package for R 



Model ID 


Volume 


Shape 


Orientation 


Covariance (S g ) 


EII 


Equal 


Equal spherical 


NA 


XI 


VII 


Variable 


Equal spherical 


NA 


\ g I 


EEI 


Equal 


Equal 


Axis aligned 


\A 


VEI 


Variable 


Equal 


Axis aligned 


\ g A 


EVI 


Equal 


Variable 


Axis aligned 


\A g 


VVI 


Variable 


Variable 


Axis aligned 




EEE 


Equal 


Equal 


Equal 


\DAD T 


EEV 


Equal 


Equal 


Variable 


XD a ADj 


VEV 


Variable 


Equal 


Variable 


\ g D g AD T g 


vvv 


Variable 


Variable 


Variable 


\ g D g A g D T g 



• Grouping: We let p(x. , xj |lj) be a normal density with parsimonious 
covariance structure as described in Table 1. That is, 

(x^,x^)|(^ = l)~iV(^ c ),S^ c )), 

lj ~ Multinomial(l, r), 

where r = (n, r 2 , . . . , tg)- 

• No Grouping: We let p(x^|lj) be a normal density with parsimonious 
covariance structure. In addition, p(x^|x^) is assumed to have a linear 




No Grouping Grouping 

Fig. 3. A graphical model representation of the Grouping and the No Grouping models. 



10 T. B. MURPHY, N. DEAN AND A. E. RAFTERY 

regression model structure. That is, 

^\(l ig = l)r, N (^\^), 

lj ~ Multinomial (l,r), 
xf ) |x i (c) ~iV(a + /3 T xJ c) , C T 2 ), 
where r = (n, r 2 , . . . , tq)- 

The same model structure is assumed for p(x^|x£ , ) in the Grouping 

model as in the No Grouping model. Therefore, this part of the model does 

(p) 

not influence the choice to include x^ in the model or not. 

The decision as to whether the Grouping or No Grouping model is ap- 
propriate is made using the BIC approximation of the log Bayes factor. The 
logarithm of the Bayes factor is 

(3.1) log(Bayes Factor) = l og P ^ M ^ 



p(xi\M NG ) 

where Mg is the Grouping model, M ng is the No Grouping model and 



p(*i\Mk) = J p(xi\9 k ,Mk)p(0k\Mk)d6 k 

is the integrated likelihood of model M k - We use the BIC approximation of 
the integrated likelihood in the form 

BIC = 2 x log maximized likelihood — dlog(n), 

where d is the number of parameters in the model and n is the sample size 
[Schwarz (1978)]. Following Raftery and Dean (2006), the log Bayes factor 
(3.1) can be reduced to 

vU {p) x (c) \Mr) 

log(Bayes Factor) = log ■ 



p{4 ) \^\m ng )p{^ ) \M ng ) 
(3.2) 

« - [BIC (Grouping) - BIC(No Grouping)], 

which only involves (xj c ^ , x^ ) and not x|°^ . Variables with a positive differ- 
ence in BIC(Grouping) — BIC(No Grouping) are candidates for being added 
to the model. 

At each variable addition stage, the BIC of the grouping model is calcu- 
lated using each of the ten covariance structures given in Table 1 and the 
model with the highest BIC is selected for the Grouping model for model 
comparison purposes. 
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At each stage, we also check if an already chosen variable should be re- 
moved from the model. This decision is made on the basis of the BIC dif- 

(p) 

ference in a similar way to previously. In this case, takes the role of the 

(c) 

variable to be dropped, takes the role of the remaining chosen variables 

and are the other variables. The variables with a positive difference in 
BIC (Grouping) — BIC (No Grouping) are candidates for removal from the 
model; in this case, the BIC for the no grouping models are computed for 
all covariance structures from Table 1 and the model with the highest BIC 
is selected as the No Grouping model. 

3.3. Discriminant analysis with updating. In standard discriminant anal- 
ysis, the unlabeled data are not used in the model fitting procedure. How- 
ever, these data contain information that is potentially important, especially 
when very few labeled data values are available. We can model both the la- 
beled and unlabeled data as coming from the same model, but where the 
unlabeled data is missing the labeling variable, this leads to a mixture model 
for the unlabeled data. Hence, the unlabeled data can then be used to help 
fit a model to the data. This idea has been investigated by many authors, 
including Ganesalingam and McLachlan (1978) and O'Neill (1978) and more 
recently by Dean, Murphy and Downey (2006), Chapelle, Scholkopf and Zien 

(2006) , Toher, Downey and Murphy (2007) and Liang, Mukherjee and West 

(2007) . 

Let (xi,1i),(x 2 ,12),...,(xtv,Lv) be the labeled data and yi,y2, • ■ • ,YM 
be the unlabeled data. We let z = (zi , z 2 , . . . , zm) be the unobserved (miss- 
ing) labels for the unlabeled data. In this framework, the Grouping and No 
Grouping models for the observed data are of the form: 

• Grouping: We let ^(x^ , x| \U) be a normal density with parsimonious 
covariance structure as described in Table 1, namely, 

lj ~ Multinomial ( 1, r). 

Also, p{yj 3 \y^) is a mixture of normals with parsimonious covariance 
structures, namely, 

9=1 

• No Grouping: We let p(x^|lj) be a normal density with parsimonious 
covariance structure, namely, 

^\(l ig = l)~N($\z¥), 

lj ~ Multinomial (1, r). 
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We also let p{y^ ) be a mixture of normal densities with parsimonious 
covariance structure, namely, 

9=1 

In addition, we assume a linear regression model for p{yif^ |xj c ^) and p(y^ \ 
yj c) ), namely, 

X l W |x< c W(a + /Fx£V) 

and 

yf |yf ~^(a + /3 T yi c V). 

In both models, we assume an identical model structure for p(x^ [x> , x^) 
and p(y^ \y\ C \ Yj)> an ^ this doesn't affect the choice to include a variable 
in the model or not. 

This model can be fitted using the EM algorithm [Dempster, Laird and 
Rubin (1977)] by introducing the missing labels z into the model. The cal- 
culations involved in fitting the model including the labeled and unlabeled 
data follow those outlined in Dean, Murphy and Downey (2006). The max- 
imum likelihood estimates for the regression part of the model correspond 
to least squares estimates of the regression parameters. 

The final estimates of the posterior probability of group memberships pro- 
duced by the EM algorithm are used to classify the unlabeled observations. 
Thus, each observation j is classified into the group g that maximizes Zj g 
over g, where 

. f aP (yf|A< c >,S< c) ) 

Z " E? ri VP(yf|A<?,i$>)' 

y^ is the set of chosen variables, and {(t 9 , p,g ,T, g ) : g = 1,2, ... ,G} are 
the maximum likelihood estimates for the unknown model parameters for 
this set of chosen variables. 

3.3.1. Example. An illustrative example of the BIC calculations when 
the proposed algorithm is applied to the meat spectroscopy data is shown in 
Figures 4-6; half the data of each type were randomly selected as training 
data in this example. 

The variable selection algorithm begins by selecting 626 nm as the wave- 
length with the greatest difference between the Grouping and No Grouping 
models (Figure 4) and the E covariance structure was chosen. It is worth 
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~~ 1 I I I I 

500 1000 1500 2000 2500 



Wavelength (nm) 

Fig. 4. A plot of the BIC difference for each wavelength. The wavelength with the greatest 
BIC difference is 626 nm. 

noting that wavelengths close to 626 nm still have strong evidence of group- 
ing even though the spectra are smoothly varying. This phenomenon is due 
to the fact that the spectrum consists of a number of overlapping peaks and 
the reflectances at adjacent locations can have different sources. As a result, 
extra grouping information can be available at wavelengths that are very 
close. 

Subsequently, the 814 nm wavelength is added to the model (Figure 5) 
and the EEV covariance structure was chosen. At the third stage, the 774 nm 




~~ i i i i i 

500 1000 1500 2000 2500 



Wavelength (nm) 

Fig. 5. A plot of the BIC difference for each wavelength given that wavelength 626 nm 
is already accepted. The wavelength with the greatest BIC difference is 814 nm. Note that 
wavelengths close to 626 nm still have positive BIC difference values. 
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— 1 1 1 1 1 — 

500 1000 1500 2000 2500 



Wavelength (nm) 

Fig. 6. A plot of the BIC difference for each wavelength given that the first two wave- 
lengths chosen (626 nm and 814 nm) are already accepted. The wavelength with the greatest 
BIC difference is 774 nm - 



Table 2 

A full example of the variable selection procedure used to classify the meat samples into 
five types. The updating procedure was used in this example 



Iteration 


Proposal 


BIC diff. 


Decision 


Proposal 




BIC diff. 


Decision 


1 


Add 626 nm 


425 


4 


Accepted 












2 


Add 814 nm 


274 


1 


Accepted 












3 


Add 774 nm 


427 


4 


Accepted 


Remove 774 


nm 


-427 


4 


Rejected 


4 


Add 664 nm 


142 


6 


Accepted 


Remove 626 


nm 


-120 


1 


Rejected 


5 


Add 680 nm 


220 


1 


Accepted 


Remove 774 


nm 


-78 


8 


Rejected 


6 


Add 864 nm 


165 


2 


Accepted 


Remove 774 


nm 


-91 


7 


Rejected 


7 


Add 602 nm 


118 


9 


Accepted 


Remove 774 


nm 


-26 


3 


Rejected 


8 


Add 794 nm 


118 


3 


Accepted 


Remove 774 


nm 


-86 


2 


Rejected 


9 


Add 702 nm 


178 


6 


Accepted 


Remove 774 


nm 


-127 


5 


Rejected 


10 


Add 1996 nm 


127 


5 


Accepted 


Remove 1996 


nm 


-127 


5 


Rejected 


11 


Add 644 nm 


76 


6 


Accepted 


Remove 644 


nm 


-76 


6 


Rejected 


12 


Add 2316 nm 


24 


1 


Accepted 


Remove 2316 


nm 


-24 


1 


Rejected 


13 


Add 2310 nm 


103 


2 


Accepted 


Remove 702 


nm 


-26 


1 


Rejected 


11 


Add 1936 nm 


10 


8 


Accepted 


Remove 702 


nm 


4 


4 


Accepted 


15 


Add 704 nm 


-3 


7 


Rejected 


Remove 1936 


nm 


-41 


3 


Rejected 



wavelength is selected (Figure 6) and the VEV covariance structure was cho- 
sen. The procedure continues until thirteen wavelengths are selected (details 
of the iterations are given in Table 2) and the VEV covariance structure is 
chosen at all subsequent stages. 

Interestingly, many of the chosen wavelengths are in the visible range 
(400-800 nm) of the spectrum, indicating that color is important when sep- 
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arating the meat samples. The closest two wavelengths that were selected 
were 2310 nm and 2316 nm and a number of wavelengths that were selected 
are approximately 20 nm apart. In summary, the selected wavelengths are 
spread out mainly in the visible region, but some wavelengths were selected 
in the near-infrared region. 

3.4. Headlong model search strategy. The variable selection algorithm 
demonstrated in Section 3.3.1 is a greedy search strategy. At the variable ad- 
dition stages of the algorithm, the variable with the greatest BIC difference 
is added and at variable removal stages, the variable with the greatest BIC 
difference is removed. The process of finding the variable with the greatest 
BIC difference involves calculating the BIC difference for all variables un- 
der consideration; for the spectroscopic data there are typically just under 
1050 variables under consideration at the variable addition stages. Hence, 
this search strategy is computationally demanding; this feature is shared by 
other wrapper variable selection methods too. 

A less computationally expensive alternative is to use a headlong search 
strategy [Badsberg (1992)]. The variable added or removed in the headlong 
search strategy need not be the best in terms of having the greatest BIC dif- 
ference; it merely needs to be the first variable considered whose difference 
is greater than some pre-specified value (here min. evidence). We found that 
min. evidence = gave good results for the applications in this paper. The 
headlong strategy has close connections to the "first-improvement" moves 
used in local search algorithms [e.g., Hoos and Stiitzle (2005), Chapter 2.1]. 
This means that instead of adding the variable with the greatest evidence 
for Grouping versus No Grouping, the first variable found to have a cer- 
tain amount of evidence for Grouping versus No Grouping would be added. 
At the variable addition stages of the algorithm, the remaining variables 
are examined in turn from an ordered list. The initial order of the list is 
based on the variables' original BIC differences at the univariate addition 
stage; this ordering was used in a similar context in Yeung, Bumgarner and 
Raftery (2005). We experimented with the initial ordering and also tried 
using increasing wavelength and decreasing wavelength. The classification 
performance was not sensitive to the initial ordering, but the selected vari- 
ables did depend on the ordering. In the context of increasing and decreasing 
wavelength, there was a bias toward selecting low and high wavelengths, re- 
spectively. 

Here is a summary of the algorithm: 

1. Select the first variable that is added to be the one that has the most 
evidence for Grouping versus No Grouping in terms of greatest BIC dif- 
ference (the same as the first step of the greedy search algorithm) . Create 
a list of the remaining variables in decreasing order of BIC differences. 
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2. Select the second variable that is added to be the first variable in the 
list of remaining variables with BIC difference for Grouping versus No 
Grouping, including the first variable selected, greater than min. evidence. 
Any variable checked and not used at this stage is placed at the end of 
the list of remaining variables. 

3. Select the next variable that is added to be the first variable in the list of 
remaining variables with BIC difference for Grouping versus No Group- 
ing, including the previous variables selected, greater than min. evidence. 
If no variable has BIC difference greater than min. evidence, then no vari- 
able is added at this stage. Any variable checked and not used at this stage 
is placed, in turn, at the end of the list of remaining variables. 

4. Check in turn each variable currently selected (in reverse order of in- 
clusion) for evidence of No Grouping (versus Grouping), including the 
other selected variables, and remove the first variable with BIC difference 
greater than min. evidence. If no variable has BIC difference greater than 
min. evidence, then no variable is removed at this stage. The removed 
variable is placed at the end of the list of other remaining variables. 

5. Iterate steps 3 and 4 until two consecutive steps have been rejected, then 
stop. 

4. Results. The proposed methodology was applied to the two food au- 
thenticity data sets outlined in Section 2.1. In each case, the data were split 
so that 50% of the data were used as labeled data and 50% as unlabeled. 
The methodology was applied to 50 random splits of labeled and unlabeled 
data and the mean and standard deviation of the classification rate were 
computed. 

The results were compared to previously reported performance results for 
these data and several widely used alternative techniques: Random Forests 
[Breiman (2001)], AdaBoost [Freund and Schapire (1997)], Bayesian Multi- 
nomial Regression [Madigan et al. (2005)], and Transductive Support Vector 
Machines [Vapnik (1995), Joachims (1999), Collobert et al. (2006)]. 

We used the default settings in the R [R Development Core Team (2007)] 
implementations of Random Forests (randomForest version 4.5-30) [Liaw 
and Wiener (2002)] and AdaBoost (adabag version 1.1) [Cortes, Martinez 
and Rubio (2007)]. The use of various parameter settings was explored, 
but the results did not vary to a large extent with respect to the choice 
of parameter values. For Bayesian Multinomial Regression we used cross 
validation to choose between the choice of prior variance values {10 p :p = 
—4, —3, —2, —1, 0, 1, 2, 3, 4} as suggested in Genkin, Lewis and Madigan (2005) 
For the Transductive SVM analysis we used the UniverSVM software version 
1.1 [Sinz and Roffilli (2007)] with a linear kernel and parameters (c, s,z) = 
(100,-0.3,0.1); other parameter values were considered, but the values re- 
ported yielded the best classifications. 
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Table 3 

Classification performance on the Meats data for the variable selection algorithm with 
updating and for previous analyses of these data. Mean classification performance for the 
50 random splits of the data are reported with standard deviations in parentheses 



Method 


Misclassiflcation rate 


Variable selection and updating 


6.1% (3.5) 


Variable selection (greedy) and updating 


5.1% (1.9) 


Variable selection only 


9.3% (3.6) 


Dean, Murphy and Downey (2006) 


5.6% (2.0) 


McElhinney, Downey and Fearn (1999) 


7.3%-13.9% 


Transductive SVMs 


42.6% (5.7) 


Random Forests 


20.1% (3.8) 


AdaBoost.Ml 


20.3% (4.8) 


Bayesian Multinomial Regression 


34.2% (5.8) 



4.1. Meats data. The results achieved on the homogenized meat data 
(Section 2.2) are reported in Table 3. These results show that the vari- 
able selection and updating method gives comparable or better performance 
than previous analyses of these data; an improved classification rate has 
been achieved relative to those achieved by McElhinney, Downey and Fearn 
(1999) who used factorial discriminant analysis (FDA), fc-nearest neighbors 
(A;NN), discriminant partial least squares regression (PLS) and soft inde- 
pendent modeling of class analogy (SIMCA). Furthermore, a comparable 
classification performance has been achieved relative to Dean, Murphy and 
Downey (2006) who used model-based discriminant analysis with updating 
on a reduced form of the data derived from wavelet thresholding. The vari- 
able selection and updating procedure gave substantially better performance 
than other competing methods for classification. 

An examination of the misclassiflcation table (Table 4) for the variable se- 
lection and updating method shows that many of the misclassifications were 
due to the difficulty in separating the chicken and turkey groups. Interest- 
ingly, no misclassifications were made between the red and white meats. 

The chosen wavelengths show us which parts of the spectrum are of im- 
portance when classifying samples into different species. We recorded the 
chosen wavelengths for each of the 50 sets of results and these are shown in 
Figure 7. We can see that a large proportion (51%) of the chosen wavelengths 
are in the visible region (400-800 nm), but some regions in the near-infrared 
spectrum are also chosen. Liu and Chen [(2000), Table 1] assign many of 
the spectral features in the visible part of the spectrum to different forms of 
myoglobin such as deoxymyoglobin (430, 440, 445 nm), oxymyoglobin (545, 
560, 575, 585 nm), metmyoglobin (485, 495, 500, 505 nm) and sulfmyo- 
globin (635 nm). Sulfmyoglobin is a product of the reaction of myoglobin 
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Fig. 7. Wavelengths chosen in the five meat classification problem for the variable selec- 
tion and updating method. The height of the bars shows how many times the wavelength 
was chosen in 50 random splits of the data. 



with H2S generated by bacteria, and Arnalds et al. (2004) found the region 
of the spectrum close to 635 nm to be important when separating the red 
and white meat samples. The peak at 1100 nm is the wavelength where the 
sensor changes in the near-infrared spectrometer and the peak at 1068 nm 
can be attributed to third overtones of C-H stretch mode and C-H combi- 
nation bonds from meat constituents other than oxymyoglobin [Liu, Chen 
and Ozaki (2000)]. The near infrared region consisting of wavelengths near 
1510 nm has been attributed to protein, and a cluster of chosen wavelengths 
is close to this region. In all cases, between 13 and 19 wavelengths were 
chosen for classification purposes. 

Table 4 

Average classification results for the different meat types for the variable 
selection and updating classification method 



Predicted 



Truth 


Beef 


Lamb 


Pork 


Turkey 


Chicken 


Beef 


98.6 


1.4 


0.0 


0.0 


0.0 


Lamb 


1.4 


98.6 


0.0 


0.0 


0.0 


Pork 


0.0 


0.0 


99.2 


0.5 


0.3 


Turkey 


0.0 


0.0 


0.0 


88.2 


11.8 


Chicken 


0.0 


0.0 


0.0 


11.1 


88.9 
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Table 5 

Classification performance on the Meats data for the variable selection algorithm with 
updating and for previous analyses of these data after combining the chicken and turkey 
into a poultry group. Mean classification performance for the 50 random splits 
of the data are reported with standard deviations in parentheses 



Method 


Misclassiflcation rate 


Variable selection and updating 


0.8% (1.3) 


Variable selection (greedy) and updating 


0.7% (0.7) 


Variable selection only 


1.8% (3.2) 


Dean, Murphy and Downey (2006) 


1.0% (0.9) 


McElhinney, Downey and Fearn (1999) 


2.6%-4.3% 


Transductive SVMs 


20.9% (8.0) 


Random Forests 


10.5% (3.3) 


AdaBoost.Ml 


14.7% (3.7) 


Bayesian Multinomial Regression 


17.2% (4.9) 



Following McElhinney, Downey and Fearn (1999) and Dean, Murphy and 
Downey (2006), we combined the chicken and turkey groups into a poultry 
group to determine how well we can classify the homogenized meat samples 
into four types. The classification results are reported in Table 5 and the mis- 
classifications from the variable selection method with updating are shown 
in Table 6. There is a significant improvement in classification performance 
from all of the methods. Again, the white and red meats are separated with 
zero error. 

The wavelengths chosen for the four group classification problem (Fig- 
ure 8) still have a substantial proportion chosen from the visible part of the 
spectrum (52%). In this application, between 13 and 21 wavelengths were 
chosen for classification purposes. The VEV covariance structure was chosen 
in almost every run as the final model for both the four and five group meat 
classification problems. 

Table 6 

Average classification results for the different meat types after 
combining the chicken and turkey into a poultry group. The results 
shown are for the variable selection and updating method 



Predicted 



Truth 


Beef 


Lamb 


Pork 


Poultry 


Beef 


98.2 


1.8 


0.0 


0.0 


Lamb 


2.7 


97.3 


0.0 


0.0 


Pork 


0.0 


0.0 


99.1 


0.9 


Poultry 


0.0 


0.0 


0.0 


100.0 



20 



T. B. MURPHY, N. DEAN AND A. E. RAFTERY 




1 1 1 1 

500 1000 1500 2000 2500 

Wavelength (nm) 

Fig. 8. Wavelengths chosen in the four meat classification problem for the variable se- 
lection and updating method. 

4.2. Greek olive oil data. The methods were applied to the Greek olive 
oil data (Section 2.3), with 50% of the data being treated as training data 
and 50% as test data. Fifty random splits of training and test data were used. 
The misclassification rates achieved on these data are reported in Table 7. 
Variable selection and updating provides one of the best classification rates 
for these data. Downey, Mclntyre and Davies (2003) did report a better mis- 
classification rate (6.1%) using factorial discriminant analysis (FDA), but 
the choice of a subset of wavelengths, data preprocessing method and clas- 
sification method (from partial least squares, factorial discriminant analysis 
and /c-nearest neighbors) was made with reference to the test data classifi- 
cation performance. In contrast, our model selection was done without any 
reference to the test data classification performance. 

A cross tabulation of the classifications with the true origin of the olive 
oils (Table 8) reveals the difficulty in classifying the oils. 

In contrast to the meat classification problem, the chosen wavelengths for 
this problem (Figure 9) are concentrated in the near-infrared region (800- 
2498 nm), but some wavelengths in the visible region are also selected. The 
most commonly chosen wavelength is 2080 nm, which has been attributed 
to an O-H stretching/O-H bend combination [Osborne et al. (1984)]. Wave- 
lengths near 2310, 2346 and 2386 nm are due to C-H stretching vibrations 
and other vibrational modes. In particular, wavelengths in the 2310 nm re- 
gion have previously been assigned to fat content. In all cases, between 6 and 
29 wavelengths were selected, with a mean of 15 wavelengths being chosen. 
The EEE covariance structure was chosen for every final model for the olive 
oil classification problem. 
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4.3. Sensitivity to spectral resolution. In order to determine the sensitiv- 
ity of the selected wavelengths to the resolution of the spectrometer used in 
this study, we investigated the effect of reducing the number of reflectance 
values by computing the mean reflectance value over sets of adjacent wave- 
lengths and using these as inputs into the variable selection model. The 
results of this analysis are outlined for the olive oil authentication problem, 
and similar results were found for the meat species authenticity study. 

We found that the classification error of the olive oil samples increases 
slightly as soon as any adjacent wavelengths are aggregated (Table 9). How- 
ever, once the wavelengths are aggregated, the classification error remained 
steady for aggregating between 2 and 30 adjacent wavelengths. Thereafter, 
there was a serious deterioration in the classification performance when more 
than 30 adjacent wavelengths were aggregated. This suggests that a consider- 

Table 7 

Classification performance on the Olive Oil data for the variable selection algorithm with 
updating and for previous analyses of these data. Mean classification performance for the 
50 random splits of the data are reported with standard deviations in parentheses. For the 
variable selection only results, the maximum number of selected wavelengths was 
restricted to be six to avoid degeneracies 



Method 


Misclassification rate 


Variable selection and updating 


6.9% (5.4) 


Variable selection (greedy) and updating 


16.6% (11.3) 


Variable selection only 


17.9% (10.9) 


Dean, Murphy and Downey (2006) 


11.9% (6.3) 


Downey, Mclntyre and Davies (2003) 


6.1%-19.0% 


Transductive SVMs 


12.4% (7.5) 


Random Forests 


19.3% (6.5) 


AdaBoost.Ml 


34.1% (9.3) 


Bayesian Multinomial Regression 


57.0% (1.2) 



Table 8 

Average classification results for the olive oil groups. 
The results shown are for the variable selection 
and updating method 



Truth 




Predicted 




Crete 


Peleponese 


Other 


Crete 


90.0 


8.7 


1.3 


Peleponese 


1.0 


92.9 


6.1 


Other 


0.0 


3.8 


96.2 
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Table 9 

The change in classification performance for the variable 
selection and updating method as the number of adjacent 
wavelengths being aggregated increases 



Aggregation level 


Classification error 


1 


6.9% 


2 


9.7% 


3 


7.6% 


5 


7.9% 


10 


9.9% 


15 


8.5% 


30 


9.1% 


50 


13.2% 


70 


28.7% 



able amount of the group information is maintained at even low resolutions, 
but that there is more information in the raw data themselves. 

The spectral regions selected when analyzing the data in aggregated form 
were found to be stable. In both applications, the selected regions were very 
similar for the aggregated data, but fewer variables tended to be selected 
because of the aggregation process. Figure 10 shows the chosen wavelengths 
when the raw spectra, two adjacent wavelengths and three adjacent wave- 
lengths are aggregated and then analyzed for the olive oil classification prob- 
lem. This shows that the selection procedure chooses very specific spectral 
regions on both the raw and aggregated scale. 
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Fig. 9. Wavelengths chosen in the olive oil classification problem using the variable se- 
lection and updating method. The height of the bars shows how many times the wavelength 
was chosen in 50 random splits of the data. 
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Fig. 10. The chosen wavelengths when the raw olive oil spectra are analyzed and when 
adjacent wavelengths are aggregated. 

5. Discussion. The discriminant analysis method presented in this pa- 
per gave much better results than those given by popular statistical and 
machine learning techniques such as Random Forests [Breiman (2001)], Ad- 
aBoost [Freund and Schapire (1997)] and Bayesian Multinomial Regression 
[Genkin, Lewis and Madigan (2005), Madigan et al. (2005)] and Transduc- 
tive SVMs [Vapnik (1995), Joachims (1999)] for the high-dimensional food 
authenticity data sets analysed here. This improvement is further enhanced 
by the addition of the updating procedure for including the unlabeled data 
in the estimation method. The results show that the headlong search method 
for variable selection is an efficient method for selecting wavelengths. 

In addition to the improvement in classification results in the example 
data sets given, the number of variables needed for classification was sub- 
stantially reduced from 1050 to less than thirty. The variable selection results 
in the food authenticity application suggest the possibility of developing au- 
thenticity sensors that only use reflectance values over a carefully selected 
subset of the near-infrared and visible spectral range. The regions of the 
spectrum selected by the method can be interpreted in terms of the under- 
lying chemical properties of the foods under analysis. 

We have compared our method with four established leading classification 
methods from statistics and machine learning for which standard software 
implementations are available. One of these, AdaBoost, was identified by Leo 
Breiman as "the best off-the-shelf classifier in the world" [Hastie, Tibshirani 
and Friedman (2001)]. It is possible that the large improvement in perfor- 
mance of our method relative to the established methods we have compared 
it with is due to the fact that our data have many variables of which only a 
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very small proportion (l%-3%) are useful. The variables that are not useful 
may introduce a great deal of noise and degrade performance, and so other 
methods that do not reduce the number of variables may suffer from this. 

Although the methods were developed for the food authenticity applica- 
tion outlined herein, the method could be applied in contexts such as the 
analysis of gene expression data and document classification. The results of 
the variable selection procedure could mean a substantial savings in terms 
of time for data collection and space for future data storage. 

A range of recent approaches to variable selection in a classification con- 
text include the DAL ASS approach of Trendafilov and Jolliffe (2007), vari- 
able selection for kernel Fisher discriminant analysis [Louw and Steep (2006)] 
and the stepwise stopping rule approach of Munita, Barroso and Oliveira 
(2006). A number of different search algorithms (proposed as alternatives 
to backward/forward/stepwise search) wrapped around different discrimi- 
nant functions are compared by Pacheco et al. (2006), and genetic search 
algorithms wrapped around Fisher discriminant analysis are considered by 
Chiang and Pell (2004). Another example of variable selection methods in 
the context of classification using spectroscopic data is given by Indahl and 
Naes (2004). 

In terms of other approaches to variable selection, a good review of re- 
cent work on the problem of variable or feature selection in classification was 
given by Guyon and Elisseeff (2003) from a machine learning perspective. A 
good review of methods involving Support Vector Machines (SVMs) (along 
with a proposed criterion for exhaustive variable selection) is given by Mary- 
Huard, Robin and Daudin (2007). An extension allowing variable selection 
for the multiclass problem using SVMs is given by Wang and Xiatong (2007). 
An alternative approach for combining pairwise classifiers, based on Hastie 
and Tibshirani (1998), is given by Szepannek and Weihs (2006). Greenshtein 
(2006) looks at theoretical aspects of the n<p classification and variable 
selection problem in terms of empirical risk minimization subject to l\ con- 
straints. Finally, an alternative to single subset variable selection through 
Bayesian Model Averaging [Madigan and Raftery (1994)] is given by Dash 
and Cooper (2004). 

Acknowledgments. We would like to thank the Editor, Associate Editor 
and Referees whose suggestions greatly improved this paper. We would also 
like to thank Gerard Downey for providing the food authenticity data and 
for help with interpreting the results of the analysis. 

SUPPLEMENTARY MATERIAL 

Supplement: Data (DOI: 10.1214/09-AOAS279SUPP; .zip). This zipfile 
[Murphy, Dean and Raftery (2009)] contains the data sets used in this paper. 
The original data source information and conditions for the use of the data 
are outlined in this file. 
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